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Abstract 

Multidimensional  scaling  is  widely  used  to  handle  data  which  consist  of  dissimilarity  measures  be¬ 
tween  pairs  of  objects  or  people.  We  deal  with  two  major  problems  in  metric  multidimensional 
scaling  —  configuration  of  objects  and  determination  of  the  dimension  of  object  configuration  — 
within  a  Bayesian  framework.  A  Markov  chain  Monte  Carlo  algorithm  is  proposed  for  object  con¬ 
figuration,  along  with  a  simple  Bayesian  criterion  for  choosing  their  effective  dimension,  called 
MDSIC.  Simulation  results  are  presented,  as  well  as  examples  on  real  data.  Our  method  provides 
better  results  than  classical  multidimensional  scaling  for  object  configuration,  and  MDSIC  seems 
to  work  well  for  dimension  choice  in  the  examples  considered. 

Key  Words:  Clustering,  Dimensionality,  Dissimilarity,  Markov  chain  Monte  Carlo,  Metric 
scaling,  Model  selection. 
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1  Introduction 


Multidimensional  scaling  (MDS)  is  concerned  with  data  that  are  given  as  dissimilarity  measures 
between  pairs  of  objects  or  individuals.  Its  goal  is  to  represent  the  objects  or  individuals  by  points 
in  a  (usually)  Euclidean  space. 

MDS  has  its  roots  in  psychology,  specifically  psychophysics,  being  based  on  the  analogy  between 
the  psychological  concept  of  similarity,  and  the  geometrical  concept  of  distance.  Two  individuals 
are  viewed  as  similar  if  they  tend  to  have  similar  responses  to  the  same  stimuli.  Subsequently,  it  has 
been  widely  used  in  other  social  and  behavioral  sciences.  Recently,  interest  in  MDS  has  increased 
further,  due  to  its  usefulness  in  some  currently  rapidly  developed  subjects,  such  as  genomics  (Tib- 
shirani  et  al.  1999),  and  information  retrieval  for  the  Web  and  other  document  databases  (Schutze 
and  Silverstein,  1997). 

One  of  the  main  applications  of  MDS  is  visualization,  where  the  user  wants  to  represent  a 
complex  set  of  dissimilarities  in  a  form  that  is  easier  to  see.  One  reason  for  this  is  to  see  if  visually 
apparent  clusters  are  present  in  the  data.  Another  application  is  exploration,  where  the  user  wants 
to  understand  what  the  main  dimensions  underlying  the  dissimilarities  are.  For  example,  the 
objects  in  MDS  might  be  political  candidates,  and  the  data  might  consist  of  subjective  similarity 
judgements.  MDS  might  help  to  suggest  which  political  positions  or  characteristics  are  important 
in  forming  similarity  judgements  (e.g.  position  on  Social  Security,  age,  tendency  to  tell  jokes).  A 
third  application  is  hypothesis  testing.  Monogaphs  on  MDS  include  Davison  (1983),  Young  (1987), 
Cox  and  Cox  (1994),  and  Borg  and  Groenen  (1997). 

An  important  issue  in  MDS  is  configuration  of  objects,  i.e. ,  estimation  of  values  for  attributes  of 
objects.  A  commonly  used  MDS  method  for  pairwise  dissimilarity  data  was  developed  by  Torgerson 
(1952,  1958).  Object  configurations  are  easy  to  compute  with  this  method,  now  called  classical 
MDS  (CMDS).  It  gives  complete  recovery  (up  to  location  shift)  of  object  configurations  when 
the  given  dissimilarites  are  exactly  equal  to  the  Euclidean  distances  and  when  the  dimension  is 
correctly  specified.  In  many  practical  situations,  however,  there  are  measurement  errors  in  the 
observed  dissimilarities  and  no  clear  notion  of  dimension.  Maximum  likelihood  MDS  methods  have 
been  developed  for  handling  measurement  errors  —  see,  for  instance,  Ramsay  (1982,  1991),  Takane 
(1982),  Takane  and  Carroll  (1981),  MacKay  (1989),  MacKay  and  Zinnes  (1991),  Groenen  (1993), 
and  Groenen,  Mathar,  and  Heisser  (1995).  However,  justification  of  maximum  likelihood  relies  on 
asymptotic  theory  and  computation  requires  nonlinear  optimization.  The  number  of  parameters 
to  be  optimized  over  typically  grows  at  a  faster  than  linear  rate  relative  to  the  dimension,  so  that 
the  asymptotic  theory  may  not  apply  in  high  dimensions,  as  pointed  out  by  Cox  (1982).  Moreover, 
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the  likelihood  surface  will  tend  to  have  many  more  local  minima  when  there  are  more  dimensions, 
and  finding  a  good  initial  estimate  will  be  correspondingly  more  difficult. 

Another  important  issue  in  MDS  is  dimensionality,  i.e. ,  the  number  of  significant  attributes. 
Despite  its  importance  in  many  applications,  there  is  no  definitive  method  for  determining  dimen¬ 
sion  for  dissimilarity  data.  The  most  commonly  used  method  is  to  search  for  an  elbow ,  that  is  a 
point  where  a  measure  of  fit  or  a  measure  of  contribution  to  the  dissimilarity  levels  off,  in  a  plot  of 
the  measure  versus  dimension  (Spence  and  Graef,  1974;  Davison,  1983;  Borg  and  Groenen,  1997). 
However,  it  is  often  difficult  to  find  an  elbow  —  especially  when  there  are  significant  errors  —  and 
visual  inspection  of  a  plot  may  be  misleading  since  its  appearance  often  depends  on  the  relative 
scale  of  the  axes. 

In  this  paper,  we  deal  with  these  two  important  issues  in  MDS  within  a  Bayesian  framework. 
We  use  a  Euclidean  distance  model  and  assume  a  Gaussian  measurement  error  in  the  observed 
dissimilarity.  Under  the  model,  we  propose  a  simple  Markov  chain  Monte  Carlo  (MCMC)  algorithm 
to  obtain  a  Bayesian  solution  for  the  object  configuration.  We  found  that  the  proposed  method, 
which  we  call  Bayesian  MDS  (BMDS),  provided  a  much  better  fit  to  the  data  than  CMDS  in  all 
of  the  examples  we  tested.  Moreover,  the  improvement  in  performance  of  the  proposed  BMDS 
scheme  relative  to  CMDS  was  more  pronounced  when  there  were  significant  measurement  errors  in 
the  data  or  when  the  Euclidean  model  assumption  was  violated. 

Based  on  the  BMDS  estimate  of  object  configuration  over  a  range  of  dimensions,  we  propose 
a  simple  Bayesian  criterion  to  choose  an  appropriate  dimension.  This  criterion,  called  MDSIC,  is 
based  on  the  Bayes  factor,  or  ratio  of  integrated  likelihoods,  for  the  BMDS  estimated  configuration 
under  one  dimension  versus  a  different  dimension.  In  simulated  data,  we  found  that  the  criterion 
works  well  for  Euclidean  models  with  measurement  error  that  is  not  too  large.  In  real  examples, 
we  found  that  the  criterion  gave  satisfactory  results.  We  also  give  an  example  of  cluster  analysis  on 
real  dissimilarity  data  in  which  the  BMDS  estimates  of  object  configuration  are  used  in  conjunction 
with  model- based  clustering  (Banfield  and  Raftery,  1993;  Fraley  and  Raftery,  1998). 

In  our  approach,  observed  dissimilarities  are  modeled  as  equal  to  Euclidean  distances  plus  mea¬ 
surement  error.  In  this  sense,  what  we  do  here  can  be  viewed  as  a  Bayesian  analysis  of  metric  MDS, 
and  here  being  Bayesian  seems  to  confer  the  benefits  of  yielding  good  estimated  configurations,  and 
providing  a  formal  way  of  choosing  the  dimension.  A  great  deal  of  MDS  research,  however,  has 
focused  on  nonmetric  MDS,  in  which  the  relationship  between  dissimilarity  and  underlying  distance 
is  modeled  as  nonlinear.  One  could  use  the  basic  ideas  here  to  do  Bayesian  nonmetric  MDS,  and 
we  suggest  some  ways  of  doing  this  in  Section  6. 

The  rest  of  the  paper  is  organized  as  follows.  Classical  MDS  (Torgerson  1952,  1958)  is  briefly 


2 


reviewed  in  Section  2.  Bayesian  MDS  is  described  in  Section  3:  Section  3.1  presents  the  model 
and  the  prior,  Section  3.2  presents  an  MCMC  algorithm,  and  Section  3.3  describes  the  estimation 
of  object  configuration  from  the  MCMC  output.  Based  on  the  BMDS  output,  a  simple  Bayesian 
dimension  selection  criterion,  MDSIC,  is  described  in  Section  4.  Some  simulated  and  real  examples 
are  given  in  Section  5.  We  conclude  with  a  summary  and  discussion  in  Section  6. 


2  Classical  Multidimensional  Scaling 


In  this  section,  we  briefly  review  classical  MDS.  Let  Sij  denote  the  dissimilarity  measure  between 
objects  i  and  j,  which  are  functionally  related  to  p  unobserved  attributes  of  the  objects.  Let 
xi  =  ■■■■  xrp )  denote  an  unobserved  vector  representing  the  values  of  the  attributes  possesed  by 

object  i. 

Torgerson  (1952,  1958)  developed  a  technique  for  multidimensional  scaling,  now  called  classical 
MDS.  Assume  that  the  dissimilarity  measure,  6tJ.  is  the  distance  between  objects  i  and  j  in  a 
p-dimensional  Euclidean  space,  i.e., 


N 


^  ^ (xik 


k=l 


(i) 


where  is  the  k-th  element  of  x?.  The  elements  are  unknown,  and  the  goal  of  MDS  is  to 
recover  them  from  the  dissimilarity  data.  Because  of  the  non-identifiability  of  the  solution  under 
location  shift,  the  center  of  the  object  points  is  placed  at  the  origin,  so  that  xij  =  0  for 

j  =  1,  ...,p,  where  n  is  the  total  number  of  objects. 

Construct  a  double-centered  matrix  A  with  elements  defined  by 

“ij  =  ~  Sl  ~  S'2j  + 


where 


n  1  n  -inn 

j= 1  i= 1  z=l  j=l 


It  was  shown  by  Togerson  (1952,  1958)  that 


p 

aij  =  !>*■  xjk,  for  all  i,  j,  i.e.,  A  =  XX', 
k= l 


(2) 


where  X  is  the  n  x  p  matrix  of  object  coordinates.  The  coordinates  of  X  can  be  recovered  from 
the  spectral  decomposition  of  the  matrix  A  in  (2).  If  the  observed  dissimilarities,  dtJ.  satisfy  the 
Euclidean  distance  assumption  and  there  is  no  measurement  error,  then  the  Euclidean  distances 
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computed  from  the  matrix  X  satisfying  (2)  will  be  exactly  equal  to  the  given  dissimilarities.  How¬ 
ever,  when  the  model  assumption  is  violated  or  when  there  are  significant  measurement  errors  in 
the  data,  CMDS  estimates  of  object  configuration  may  not  be  very  useful. 

It  should  be  noted  that  the  matrix  X  satisfying  (2)  is  not  unique,  because  Euclidean  distance 
is  invariant  under  translation,  rotation,  and  reflection  about  the  origin.  However,  there  is  a  unique 
CMDS  solution  having  zero  mean,  diagonal  covariance,  and  some  fixed  coordinate  signs. 

At  present,  there  is  no  definitive  method  for  choosing  the  effective  dimension  of  x4,  the  number 
of  object  attributes  that  contribute  significantly  to  the  dissimilarities.  A  common  way  of  assessing 
dimension  is  to  look  at  the  eigenvalues  of  the  scalar  product  matrix  A.  The  k- th  eigenvalue  is 
a  measure  of  contribution  of  the  k- th  coordinate  of  X  to  squared  distances.  Small  eigenvalues 
(relative  to  the  largest  eigenvalue)  imply  that  the  corresponding  coordinates  make  little  contribu¬ 
tion  to  the  squared  distances  and  hence  only  the  first  p  coordinates  of  X  corresponding  to  the 
first  p  significantly  large  eigenvalues  suffice  to  represent  the  objects.  To  determine  significantly 
large  eigenvalues,  one  may  draw  a  plot  of  the  ordered  eigenvalues  versus  dimension  and  look  for  a 
dimension  at  which  the  sequence  of  eigenvalues  levels  off.  If  each  StJ  is  equal  to  a  p-dimensional 
Euclidean  distance  between  objects  i  and  j  as  given  in  (1),  then  the  plot  should  level  off  precisely 
at  dimension  (p  +  1). 

A  measure  of  fit,  called  stress ,  is  also  commonly  used  to  determine  the  dimensionality.  Several 
definitions  of  stress  have  been  proposed;  the  one  we  use  here,  and  perhaps  the  mostly  commonly 
used  one,  is 


STRESS 


\ 
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where  Sij  is  the  Euclidean  distance  obtained  from  the  estimated  object  configuration  (Kruskal 
1964).  A  plot  of  STRESS  versus  dimension  will  level  off  at  the  true  dimension  p ,  if  (1,-j  —  Sij  and 
6ij  is  given  by  (1). 

Both  methods  rely  on  detecting  an  elbow  in  a  sequence  of  values,  that  is,  a  point  where  the 
plot  levels  off.  However,  in  real  data  that  do  not  conform  exactly  to  the  model  or  in  which  there 
is  a  significant  amount  of  measurement  or  sampling  error,  an  elbow  may  be  difficult  to  discern.  In 
addition,  visual  detection  of  an  elbow  in  a  plot  can  be  misleading  since  the  outcome  may  depend 
on  the  scale  of  the  axes. 
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3  Bayesian  Multidimensional  Scaling 

3.1  Model  and  Prior 

Dissimilarity  data  can  be  obtained  in  various  forms.  However,  since  Euclidean  distance  is  easy  to 
handle  and  is  relatively  insensitive  to  the  choice  of  dimension  compared  to  other  distance  measures, 
it  tends  to  be  used  in  cases  in  which  the  dimension  is  unknown  unless  there  is  strong  theoretical 
evidence  for  preferring  a  non-Euclidean  distance  (Davison,  1983).  Thus,  for  Bayesian  MDS  we 
model  the  true  dissimilarity  measure  StJ  as  the  distance  between  objects  i  and  j  in  a  Euclidean 
space,  as  given  in  (1). 

In  practical  situations  there  are  often  measurement  errors  in  observations.  In  addition,  dis¬ 
similarity  measures  are  typically  given  as  positive  values.  We  therefore  assume  that  the  observed 
dissimilarity  measure  dij  comprising  the  data  is  equal  to  the  true  measure  Sij  plus  a  Gaussian  error, 
with  the  restriction  that  the  observed  dissimilarity  measure  is  always  positive.  In  other  words,  given 
Sij,  the  observed  dissimilarity  measure  dVJ  is  assumed  to  follow  the  truncated  normal  distribution 

dij  ~  N(Sij,a  )  I  (d^  >  0),  i  ^  j,i,j  =  l,...,n,  (3) 

where  StJ  =  \Jj2Pk-i(xik  ~  xjk)2i  and  the  xtj,.  are  unobserved.  From  this,  the  likelihood  function  of 
the  unknown  parameters  X  =  { x, }  and  a2  is 

l(x,cr2)  oc  (a2ym/2  exp  “7^2  SSR  ~  12 1°§  ^  (' ’  (4) 

i>j  '  '  _ 

where  m  =  n(n  —  1)  /2  is  the  number  of  dissimilarities,  SSR  =  J2i>j  ( dij  Sij)2  is  the  sum  of  squared 
residuals,  and  <&(•)  is  the  standard  normal  cdf. 

For  Bayesian  analysis  of  the  model,  we  need  to  specify  priors  for  X  =  { x,; } .  and  a2.  For  the 
prior  distribution  of  x.,;.  we  use  a  multivariate  normal  distribution  with  mean  0  and  a  diagonal 
covariance  matrix  A,  i.e., 

Xj  ~  IV(0,  A), 

independently  for  i  —  1, ..,  n.  For  the  prior  of  the  error  variance  a2,  we  use  a  conjugate  prior 

<r2  ~  IG(a,b ), 

the  inverse  Gamma  distribution  with  mode  b/(a  +  1).  For  a  hyper- prior  for  the  elements  of  A  = 
Diag(Ai, ...,  A p),  given  dimension  p ,  we  also  assume  a  conjugate  prior, 

A j~IG(a,Pj), 
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independently  for  j  =  1,  ,.,p.  We  assume  prior  independence  among  X,  A,  and  a 2,  i.e.,  7 r(X,  cr2,  A)  = 
7r(X)  7r(cr2)  7t(A),  where  7r(X),  7r(cr2),  and  7t(A)  are  the  priors  given  above. 

When  there  is  little  prior  information,  one  may  use  either  the  results  from  a  preliminary  run 
or  the  results  from  CMDS  for  parameter  specification  in  the  priors.  For  instance,  one  may  choose 
a  small  a  for  a  vague  prior  of  cr2  and  choose  b  so  that  the  prior  mean  matches  with  the  estimated 
mean  of  a2  from  CMDS.  Similarly,  for  the  hyper-prior  of  A  j,  one  may  choose  a  small  a  and  choose 
/ 3j  so  that  the  prior  mean  of  A j  matches  with  the  j-th  diagonal  element  of  the  covariance  matrix 
Sx  =  ^  J2i=i  xixi  obtained  from  CMDS.  As  noted  above,  the  CMDS  solution  can  be  transformed 
so  as  to  have  zero  mean  and  diagonal  covariance. 

Such  a  prior  is  mildly  data-dependent,  and  it  might  be  argued  that  this  violates  the  definition 
of  a  prior  distribution.  However,  we  view  this  prior  as  an  approximation  to  the  elicited  data- 
independent  prior  of  an  analyst  who  knows  a  little,  but  not  much,  about  the  problem  at  hand. 
Because  this  prior  is  diffuse  relative  to  the  likelihood,  the  estimation  results  are  unlikely  to  be 
highly  sensitive  to  its  precise  specification. 

3.2  Markov  Chain  Monte  Carlo 


From  the  likelihood  and  the  prior,  the  posterior  density  function  of  the  unknown  parameters 
(X,  cr2,  A)  is 


tt(X,ct2,A|D)  oc  (ct2)-(W2+«+i) 

p  ["1 

x  n Ajn/2  exp  ~^2ssr - ^iog  ^ 

j= 1  i>j 


1  n 


E 


3= 1 


(5) 


where  D  is  the  matrix  of  observed  dissimilarities.  Because  of  the  complicated  form  of  the  posterior 
density  function  (5),  numerical  integration  is  required  to  obtain  a  Bayes  estimate  of  the  parameters. 
In  particular,  the  posterior  is  a  complicated  function  of  the  Xj’s,  which  in  most  cases  are  of  high 
dimension.  We  therefore  use  a  Markov  chain  Monte  Carlo  (MCMC)  algorithm  (e.g.  Gilks  et  al. 
1996)  to  simulate  from  the  posterior  distribution  (5).  Our  algorithm  proceeds  by  iteratively  gen¬ 
erating  new  values  for  each  object  configuration  x.,;.  the  error  variance  cr2,  and  the  hyperparameter 
A,  given  the  current  values  of  the  other  unknowns. 

We  first  suggest  initialization  strategies  for  the  unknown  parameters  which  are  needed  for  the 
MCMC  algorithm.  For  initialization  of  Xj,  one  may  use  the  output,  x[0^ ,  of  Xj  from  CMDS  since 
it  is  easy  to  obtain.  The  resulting  X  can  then  be  centered  at  the  origin  and  then  transformed 
using  the  spectral  decomposition  so  as  to  have  a  diagonal  covariance  matrix,  thus  conforming  to 
the  prior.  From  the  adjusted  x-0^  's,  one  can  compute  the  sum  of  squared  residuals  SSRI{)>  and 
an  estimate  a1 0)2  =  SSR /m  which  can  be  used  as  an  initial  value  of  cr2  in  the  algorithm.  In 
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addition,  diagonal  elements  of  the  adjusted  sample  covariance  matrix  of  X  can  be  used  as  initial 
values  for  the  A? 's. 

We  now  describe  the  details  of  sample  generation  in  the  MCMC  algorithm.  At  each  iteration, 
we  simulate  a  new  value  of  A j  from  its  conditional  posterior  distribution  given  the  other  unknowns. 
From  (5),  the  full  conditional  posterior  distribution  of  A j  is  the  inverse  Gamma  distribution  IG(a  + 
n/ 2,  j3j  +  Sjj 2),  where  Sj/n  is  the  sample  variance  of  the  j-th  coordinates  of  Xj’s.  We  use  a  random 
walk  Metropolis-Hastings  step  (Hastings,  1970)  to  generate  x,  and  a2  in  each  iteration  of  the 
MCMC  algorithm. 

Generation  of  x, 

A  normal  proposal  density  is  used  in  the  random  walk  Metropolis-Hastings  algorithm  for  gener¬ 
ation  of  Xj .  To  choose  the  variance  of  the  normal  proposal  density,  we  note  that  the  full  conditional 
posterior  density  of  x,  is 


7r(xj|else)  oc  exp 


1  n  /  r 

2  i  V 


where 


Since 


dij)  ;  Q'2  —  xjA  xj- 


j&j= i 


„2  dij )  — 

(xj  -  Xj-J^Xj  -  Xj)  -  2 dijsJ(xi  -Xj)'(xj  -Xj)  +  d\ 


G 

1 


is  a  quadratic  function  of  x4  with  leading  coefficient  equal  to  1  la2  and  Qi  has  n  —  1  of  this  kind 
while  Q2  has  only  one  quadratic  term  with  coefficient  A,  Q 1  would  dominate  the  full  conditional 
posterior  density  function  of  x,;  unless  there  is  strong  prior  information.  Thus,  we  may  consider  Q\ 
only  and  approximate  the  full  conditional  variance  of  x,  by  a2 /(n  —  1)  and  choose  the  variance  of 
the  normal  proposal  density  to  be  a  constant  multiple  of  a2 /{n  —  1).  With  this  proposal  density, 
the  random  walk  M-H  algorithm  can  be  summarized  as: 


•  Generate  x-v  from  N 
constant. 


°2  ) 

(n-l)J 


,  where  xp 


is  the  current  sample  of  x,  and  A;  is  a 


•  Replace  xp  by  xp  with  probability  min{l,  7r(xP|else)/7r(xp|else)}. 


Generation  of  a2 
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From  a  preliminary  numerical  study,  we  found  that 


7r(cr2 1 else)  oc  (<r2)  (m/2+a+1)  6xp 


_J_(5Si?/2  +  &)-£log 

(j  —  \  a  / 

i>j 


is  well  approximated  by  the  density  function  of  IG(m/ 2  +  a^SSR/2  +  6),  up  to  a  constant  of 
proportionality.  When  the  number  of  dissimilarities,  m  =  n(n  —  l)/2,  is  large,  which  is  often  the 
case  since  m  is  a  quadratic  function  of  n.  the  inverse  Gamma  density  function  is  well  approximated 
by  a  normal  density.  Thus,  we  propose  a  random  walk  M-H  algorithm  with  a  normal  proposal 
density  with  variance  proportional  to  the  variance  y2  of  IG{vnj 2  +  a.  SSR/2  +  b )  distribution, 
namely: 


•  Generate  ct2jV  from  N(a2C ,  k  *  y2 ) ,  where  a2C  is  the  current  sample  of  a2. 

•  Replace  u2C  by  a2N  with  probability  min{l,  Tt(a2N\ else) /tt{u2C\ else)}. 

3.3  Posterior  Inference 

Iterative  generation  of  {  x., ,  i  =  1, ..,  n}.  er2,  and  X3 ,  j  =  1.  ...p.  for  a  sufficiently  long  time  provides 
a  sample  from  the  posterior  distribution  of  the  unknown  parameters,  and  Bayes  estimation  of  the 
parameters  can  be  obtained  from  the  samples.  However,  because  the  model  assumes  a  Euclidean 
distance  for  the  dissimilarity  measure  5^ ,  the  posterior  samples  of  { x,; }  would  be  invariant  under 
translation,  rotation  and  reflection  about  the  origin,  as  in  classical  MDS,  unless  there  is  strong 
prior  information  to  the  contrary.  We  can  retrieve  only  the  relative  locations  of  the  objects  from 
the  data,  and  not  their  absolute  locations.  Hence  the  convergence  of  Sij  rather  than  X  needs  to  be 
checked  to  verify  the  convergence  of  MCMC.  The  near  lack  of  identifiability  in  X  also  suggests  the 
use  of  sample  averages  as  Bayes  estimates  to  be  inadvisable,  since  the  MCMC  samples  of  X  may  be 
unstable  even  when  the  distances  S^  are  stable.  Thus,  we  take  an  approximate  posterior  mode  of  X 
as  a  Bayes  estimate  of  X,  i.e. ,  the  BMDS  solution  of  the  object  configuration.  The  posterior  mode 
provides  relative  positions  of  Xj’s  corresponding  to  the  maximum  posterior  density.  A  meaningful 
absolute  position  of  X  may  be  obtained  from  an  appropriate  transformation,  if  desired. 

To  obtain  the  posterior  mode  of  X,  one  may  compute  the  posterior  kernel  for  each  MCMC 
sample.  However,  this  can  be  time  consuming,  since  the  posterior  is  complicated.  However,  we 
observed  that  the  likelihood  dominates  the  prior  and  in  the  likelihood  (4)  the  term  involving  SSR  is 
dominant,  so  that  the  posterior  mode  of  X  can  be  approximated  by  the  value  of  X  which  minimizes 
the  sum  of  squared  residuals  SSR. 
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Since  the  center  and  direction  of  X  can  be  arbitrary,  we  post-process  the  MCMC  sample  of  X 
at  each  iteration  using  the  transformation 

xi  =  Dx(*i  ~  x); 

where  x  is  the  average  of  x*’s  and  Dx  is  the  matrix  whose  columns  are  the  eigenvectors  of  the 
sample  covariance  matrix  Sx  =  ^  Sr=i(x*  —  x)'(xj  —  x)  of  xys.  This  transformation  does  not  solve 
the  non-indentifiability  problem  but  the  new  x,  ’s  have  mean  0  and  a  diagonal  covariance  matrix  to 
correspond  to  the  prior  specification. 

4  Dimensionality 

BMDS  as  described  in  the  previous  section  gives  object  configurations  in  a  given  dimensional 
Euclidean  space.  In  most  cases  the  dimension  of  the  objects  (the  number  of  significant  attributes) 
is  unknown.  In  this  section,  we  propose  a  simple  Bayesian  dimension  selection  criterion  based  on 
the  BMDS  object  configurations. 

Consider  the  dimension  p  as  an  unknown  variable  and  assume  equal  priors  for  all  values  of  p. 
Then  the  posterior  is  given  as 

7r(X,  a2,  A,p\D)  oc  /(X,  cr2,p|D)7r(X|A,p)7r(cr2)7r(A|p) 

=  (2 *)-ml2o-mexp\~SSR  -  £(<*,< Hki/a)]  ■  (2^l2  f[  A £  A-.j] 

i>j  j=  1  j= 1  3 

P 

x  T {a)-1^ (o2)-(a+1'1  exp[— b/a2]  •  T(a)~p  A  ^  - Q + 1  ^ exp [ — dj / Xj ] 

3=1 

=  A-h(a2,X)-g(  A,X), 

where 


n 


S3 

=  E4’ 

i— 1 

(6) 

A 

P 

=  (2vr)-(m+np)/2  T(a)-16a  T(a)-p  J] 

3=1 

(7) 

h(a2,  X) 

=  (<J2)-(m/2+a+l)  gxp  [.(55^/2  +  &)/CT2]  , 

(8) 

g(  a,x) 

=  f[  atC"/2+°+D  exp  [— (s,/2  +  . 

3=1 

(9) 

Note  that,  because  of  the  post-processing  described  in  Section  3.3,  the  Xj’s  have  mean  0  and  a 
diagonal  covariance  matrix. 
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We  adopt  a  Bayesian  approach  to  choosing  the  dimension.  We  view  the  overall  task  to  be  that  of 
choosing  the  best  configuration,  and  hence  we  view  the  choice  between  dimension  p  and  dimension 
p'  as  being  between  the  estimated  configuration  with  dimension  p  and  the  estimated  configuration 
with  dimension  p' .  Thus,  we  consider  the  marginal  posterior,  7r(X,p|D),  of  (X,p)  with  X  equal  to 
the  BMDS  solution  and  choose  the  value  of  p  which  gives  the  largest  value  of  7r(X,p|.D).  When 
choosing  between  dimension  p  and  dimension  p' .  this  is  based  on  the  posterior  odds  for  the  estimated 
configuration  of  dimension  p  as  against  the  estimated  configuration  of  dimension  p' . 

Now,  to  compute  the  criterion,  note  that 

ir(X.,p\D)  oc  J /(X,  a2 1p\D)ir(a2)da2  ■  j  7r(X,A,p)dA 
*  Z(X,p|£>)7r(X,p), 

where  Z(X,p|T>)is  the  marginal  likelihood  of  (X,p)  and  7r(X,p)  is  the  marginal  prior  of  (X,p). 
The  marginalised  likelihood  term  would  increase  as  p  increases.  However,  the  marginal  prior  term 
decreases  as  p  increases,  since  we  are  using  a  diffuse  (but  proper)  prior,  and  so  this  term  penalizes 
more  complex  models.  The  approach  has  the  simplicity  of  a  maximum  likelihood  method,  as  well 
as  the  advantage  of  a  Bayesian  method  in  penalizing  more  complex  models. 

Integrating  the  function  g( A,  X)  given  in  (9)  with  respect  to  A  gives 

[  g(A,  X)dA  =  Tp(n/2  +  a)  f[  (Sj/2  +  ^/2^ .  (10) 

J  3=  1 

The  integral  of  the  function  h(a2.  X)  given  in  (8)  with  respect  to  a2  is  approximately  equal  to 

J  h(a2,X)da2  »  (2TT)1/2  (m/2)-1/2  (SSR/m)-m/2+1  exp[— m/2],  (11) 

This  formula  is  justified  in  the  Appendix.  From  these, 

tt(XiP\D)  oc  A  ■  J  h(a2)da2  ■  J  g(A,X)dA 

p 

=  A*  ■  ( SSR/m)-m/2+1  •  J] (sj/2  +  /3j)-^2+a\  (12) 

3=  i 

where 

A*  =  A  ■  (2i r)1/2  Tp(n/2  +  a)  (m/2)-1/2  exp[— m/2]. 

To  clarify  the  dependence  of  X  on  p ,  let  X*T  denote  the  BMDS  solution  of  X  when  the  dimension 
is p.  There  is  a  difficulty  in  directly  comparing  (XiRi.p)  and  (Xlp+li .p  +  1).  The  marginal  posterior 
7r(X,p|D)  is  dependent  on  the  scale  of  X,  because  it  includes  the  term  nf=i(sj/2  +  Pj)  (-n/2+a'1 . 
Note  that  Sj/n  is  the  sample  variance  of  the  j-th  coordinate  of  X.  However,  without  improvement 
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in  the  fit,  the  scale  of  X  may  change  with  the  dimension  p.  Given  the  same  Euclidean  distances, 
the  coordinates  of  X  would  get  closer  to  the  origin  as  p  increases,  unless  all  the  extra  coordinates 
are  equal  to  0.  For  instance,  the  Euclidean  distance  between  —1  and  1  in  one-dimensional  space 
is  equal  to  the  Euclidean  distance  between  ( 1  / x/2, 1/V2)  and  ( — 1  / x/2 ,  — 1/ \/2)  in  two-dimensional 
space,  and  hence  the  variance  in  each  coordinate  is  smaller  in  two-dimensional  space.  This  would 
give  a  smaller  Sj  and  hence  a  larger  Tr(X.p\D)  in  a  higher  dimension,  even  though  there  is  no 
change  in  the  distance  and  the  fit. 

To  circumvent  this  scale  dependency,  a  dimension  selection  criterion  should  compare  X’s  in  the 
same  dimension.  For  this,  let  X*(p+1)  =  (X*-p)  :  0)  in  (p+  1) -dimensional  space,  which  has  the  first 
p  coordinates  equal  to  Xip>  and  the  last  coordinates  all  equal  to  0.  Then  X**p  l  1 ;  provides  the  same 
Euclidean  distances  and  the  same  fit  as  Xlpi  and  may  be  considered  an  implantation  of  XiR;  in 
(p  +  l)-dimensional  space.  Ideally,  if  p  is  the  correct  dimension,  then  the  optimal  solution  Xip  1  ri 
in  ( p  +  1) -dimensional  space  would  be  X*(p+1).  Thus,  we  compare  X*(p+1)  and  X-p+  ri  and  choose 
p  to  be  the  dimension  if  1 1  has  a  larger  marginal  posterior  density  than  X!-/J+ 1 1 . 

From  (12),  the  ratio  of  the  marginal  posteriors  of  X^p+B  and  X*-p+1)  is 


Rp  = 


tt{X^+1\P  +  1\D) 
tt{X<p+1\p  +  1\D) 


(  S  SR. 


■p+ 1 


\ssr;+ , 


(SSRp+1' 
\  SSRP 


V  2+l  /p+ 1 


-m/2+1 


(n/2+a) 


P  Jp+B 


-(n/2+a) 


S(P+B 

Sp+ 1 


/2  +  f3j 


-(n/2+a) 


=i  4P)/2  +  /3, 


Pi 


where  sjp'>  is  Sj  given  in  (6),  computed  from  X<PK  Clearly,  the  ratio  Rp  depends  on  the  choice  of 
the  hyper-parameters  a  and  (3j  of  A. 

When  there  is  no  strong  prior  information,  a  reasonable  choice  for  a,/3j  in  (p  +  1) -dimensional 
space  might  bea  =  j  and  f3j  =  ^  sj  1  1 1  /n  so  that  the  prior  information  roughly  corresponds  to  the 
information  from  one  observation.  This  is  close  to  the  unit  information  prior,  which  was  observed  by 
Kass  and  Wasserman  (1995)  to  correspond  to  the  BIC  approximation  to  the  Bayes  factor  (Schwarz 
1978),  and  by  Raftery  (1995)  to  correspond  to  a  similar  approximation  to  the  integrated  likelihood. 
Raftery  (1999)  argued  that  this  is  a  reasonable  proper  prior  for  approximating  the  situation  where 
the  amount  of  prior  information  is  small. 

This  yields  the  ratio 


Rp  = 


(  SSR 


\  SSR. 


p+i 


-mj  2+1 


„(p+b 


(n+  !) 


-(n+ 1)/2 


“l  (n  +  r<f+1)) 


("  +  1)' 


-{n+ 1)/2 
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where  r{p+1)  =  s^/sf . 


Taking  minus  twice  the  log  of  the  ratio  gives 


LRp  =  —2  log  Rp 

=  (m  —  2)  log(SSRp+i/SSRp) 


+ 


p 

(n  +  i)  ^iog 
3=1 


„(p+!) 


(n  +  !) 


(n  + 


+  {n  +  1)  log(n  +  1) 


(13) 

(14) 


Note  that  the  term  (13)  in  LRp  is  roughly  the  log  likelihood  ratio,  and  would  be  negative  since 
higher  dimension  results  in  a  smaller  SSR.  The  term  (14)  plays  the  role  of  penalty  on  the  increase 
of  dimension  by  one  and  would  be  positive  if  <  1  and  IIy= i  >  V(n  +  !)■  When  there 

is  no  significant  change  in  X  between  p-  and  (p  +  1) -dimensional  spaces,  then  rjP 1  1  1  ~  1  and  the 
penalty  term  is  approximately  (n  +  1)  log  (n  +  1). 

A  positive  LRp  would  prefer  the  dimension  p  to  (p  +  1)  and  a  negative  value  would  prefer  the 
dimension  (p+ 1)  to  p  and  hence  one  can  select  the  dimension  where  the  value  of  LRp  turns  positive. 
Alternatively,  if  we  define  MDSIC  as 


MDSICi  =  (m  -  2 )log  SSRx 

v- 1 

MDSICp  =  MDSICi  +  Y.LR3 

3= 1 

then  the  optimal  dimension  is  the  one  which  achieves  the  minimum  of  MDSICp. 


(15) 

(16) 


5  Examples 

BMDS  requires  that  prior  parameters  be  specified.  For  all  the  examples  given  in  this  section,  we 
chose  5  degrees  of  freedom  a  for  the  prior  of  a2  and  chose  b  to  match  the  prior  mean  of  a2  with 
the  estimate  obtained  from  the  CMDS.  Note  that  a  smaller  a  would  not  make  much  difference 
since  m  =  n(n  —  l)/2  is  large.  For  the  hyper-prior  of  A j,  we  choose  a  =  1/2  and  j3j  =  /n, 

where  /n  is  the  estimated  variance  of  the  j-th  coordinate  of  X  obtained  from  the  CMDS,  which 
roughly  corresponds  to  information  from  one  observation  as  described  in  Section  4. 

For  the  constant  k  in  the  Metropolis-Hastings  algorithms  for  generating  x,;  and  cr2,  we  chose 
k  =  2.382  for  both  Xj  and  cr2  as  suggested  by  Gelrnan  et  al.  (1996).  We  found  reasonably  fast 
mixing  in  MCMC  with  this  choice  of  k. 

5.1  A  Simulation 

As  an  illustrative  example,  we  generated  50  random  samples  of  x,  from  a  10-dimensional  multivari¬ 
ate  normal  distribution  with  mean  0  and  variance  I,  the  identity  matrix.  We  used  the  Euclidean 


12 


dim 

CMDS 

STRESS 

BMDS 

STRESS 

LRT 

Penalty 

MDSIC 

1 

0.6622 

0.4864 

-1118.7 

177.0 

10673 

2 

0.4943 

0.3078 

-830.4 

170.7 

9731 

3 

0.3720 

0.2192 

-693.0 

165.3 

9071 

4 

0.2751 

0.1651 

-638.1 

164.2 

8544 

5 

0.2037 

0.1272 

-499.2 

174.2 

8070 

6 

0.1580 

0.1037 

-535.5 

171.4 

7745 

7 

0.1092 

0.0833 

-334.0 

178.6 

7381 

8 

0.0809 

0.0727 

-237.6 

177.8 

7225 

9 

0.0672 

0.0660 

-195.2 

182.2 

7165 

10 

0.0614 

0.0609 

-23.6 

196.8 

7152* 

11 

0.0658 

0.0603 

12.3 

203.4 

7326 

12 

0.0715 

0.0606 

-22.8 

195.0 

7541 

13 

0.0784 

0.0601 

2.5 

232.7 

7713 

14 

0.0855 

0.0601 

-29.8 

170.5 

7949 

Table  1:  Analysis  of  the  simulation  data  in  Example  1,  Xj  ~  iVio(0,I).  The  minimum  MDSIC  is 
marked  with  a  star. 

distances  between  pairs  of  Xj,Xj  as  dissimilarities  5ij.  Given  these  <%’s,  we  generated  the  observed 
distances  (1,-j  from  a  normal  distribution  with  mean  6tJ  and  standard  deviation  0.3,  truncated  at  0. 
Thus,  the  data  consist  of  a  50  x  50  symmetric  matrix  of  dissimilarities  computed  from  Euclidean 
distances  with  Gaussian  errors. 

Using  the  results  from  CMDS  for  initialization,  BMDS  as  described  in  Section  3  was  applied 
for  various  values  of  the  dimension  p.  Time  sequence  plots  from  samples  of  S^s  and  a2  in  MCMC 
converged  quickly,  similarly  to  Figure  2.  We  took  samples  from  10000  consecutive  iterations  after 
3000  burn-in  iterations  from  MCMC.  With  minimum  SSR  and  x,;  obtained  from  the  BMDS,  we 
applied  MDSIC  described  in  Section  4  to  select  the  dimension  of  x?.  The  results  are  summarized 
in  Table  1. 

The  first  and  the  second  columns  show  values  of  STRESS  from  CMDS  and  BMDS,  respectively. 
The  third  and  the  fourth  columns  show  the  likelihood  ratio  term  of  (13)  and  the  penalty  term 
of  (14),  respectively.  The  last  column  shows  the  MDSIC  given  in  (16).  It  can  be  observed  that 
BMDS  gives  a  better  fit  than  CMDS,  providing  a  smaller  STRESS,  especially  when  the  dimension 
is  incorrect.  This  is  interesting  because,  for  visualization  purposes,  dimension  p  =  2  is  often  chosen. 
In  this  case,  the  STRESS  from  CMDS  for  dimension  2  is  60%  greater  than  for  BMDS. 

It  is  interesting  to  note  that,  for  CMDS,  STRESS  increases  after  dimension  10  while  for  BMDS, 
STRESS  stays  roughly  constant  after  dimension  10.  Ideally,  since  the  true  dimension  is  10,  all 
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NY 

22 

36 

48 

43 

26 

51 

24 

73 

100 

68 

Azores 

AZ 

PY 

54 

33 

59 

33 

31 

37 

93 

88 

84 

Bagdad 

39 

BD 

PS 

57 

7 

56 

72 

50 

57 

105 

61 

Berlin 

22 

20 

BN 

RO 

57 

66 

18 

69 

113 

84 

115 

Bombay 

59 

20 

39 

BY 

RE 

63 

74 

57 

57 

101 

61 

BuenosAires 

54 

81 

74 

93 

BS 

SF 
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Figure  1:  Airline  distances  between  cities  (100  miles)  (The  World  Almanac,  1966). 

coordinates  except  for  the  first  10  should  be  roughly  zero  and  SSR  should  be  about  the  same 
after  dimension  10.  This  is  precisely  what  ocurred  in  BMDS,  indicating  that  BMDS  handles  the 
measurement  errors  well. 

The  third  and  the  fourth  columns  show  that  the  log  likelihood  ratios  computed  from  the  BMDS 
solution  for  X  decrease  monotonically  as  p  increases  upto  10,  that  there  is  no  significant  change 
after  dimension  10,  and  that  the  penalties  stay  about  the  same  for  various  p.  Moreover,  the  MDSIC 
assumes  a  minimum  at  the  correct  dimension,  namely  10. 

5.2  Airline  Distances  between  Cities 

The  World  Almanac  (1966,  p.510)  provided  airline  distances  between  30  principal  cities  of  the 
world,  which  are  shown  in  Figure  1.  Cities  are  located  on  the  surface  of  the  earth,  a  3-dimensional 
sphere,  and  airplanes  travel  on  the  surface  of  the  earth.  Thus,  airline  distances  are  not  exactly 
Euclidean  distances  and  we  may  expect  the  dimension  of  Xj  to  be  between  2  and  3. 

The  BMDS  is  applied  to  the  data.  Figure  2  shows  time  sequence  plots  of  SSR  and  some  StJ‘s 
from  10000  MCMC  iterations  after  a  burn-in  of  3000.  The  plots  suggest  that  convergence  has  been 
achieved.  Here  we  are  interested  in  finding  an  approximate  posterior  mode  of  X  rather  than  its 
full  posterior  distribution,  so  that  convergence  requirements  are  less  stringent  than  if  one  seeks  the 
full  posterior  distribution.  The  BMDS  results  are  shown  in  Table  2. 

Bayesian  MDS  yielded  much  smaller  SSR  than  classical  MDS  in  all  cases.  The  estimated  SSR 
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dim 

CMDS 

STRESS 

BMDS 

STRESS 

LRT 

Penalty 

MDSIC 

1 

0.6782 

0.3617 

-704.2 

95.7 

5336 

2 

0.4682 

0.1604 

-548.5 

91.4 

4727 

3 

0.3811 

0.0851 

4.7 

108.0 

4270* 

4 

0.4006 

0.0856 

-2.4 

88.9 

4383 

5 

0.4139 

0.0854 

4.1 

143.9 

4469 

Table  2:  Analysis  of  the  City  Data.  The  minimum  MDSIC  is  marked  with  a  star. 


dropped  very  quickly  until  dimension  3  and  then  increased  slightly  at  dimension  4.  MDSIC  selected 
dimension  3.  We  observed  that  the  last  coordinates  of  x*  in  dimension  4  are  almost  equal  to  0, 
indicating  strong  evidence  for  dimension  3. 


Figure  3  is  a  plot  of  the  observed  airline  distances  versus  the  estimated  Euclidean  distances. 
A  perfect  fit  would  yield  a  forty-five  degree  line  as  shown  in  Figure  3.  The  estimated  Euclidean 
distances  from  BMDS  are  represented  as  red  dots  and  and  those  from  CMDS  as  green  dots.  One 
can  see  that  BMDS  provided  points  very  close  to  the  forty- five  degree  line  except  for  the  points 
corresponding  to  large  distances.  The  fit  gets  worse  as  the  distance  gets  larger,  because  when 
cities  are  farther  apart,  there  is  a  greater  discrepancy  between  airline  distance  and  3-dimensional 
Euclidean  distance. 


Figure  4  shows  plots  for  the  locations  of  cities,  obtained  from  BMDS  with  dimension  p  =  3, 
and  rotated  manually  to  approximately  fit  the  true  location  of  the  cities.  One  can  observe  that  the 
cities  are  located  approximately  on  the  surface  of  a  sphere  with  the  radius  of  the  earth  and  that 
the  locations  of  the  cities  are  well  recovered. 

5.3  Careers  of  Lloyds  Bank  Employees,  1905—1950 

Sociologists  are  interested  in  characterizing  and  describing  careers,  to  answer  questions  such  as: 
What  are  the  typical  career  patterns  in  a  given  period  in  a  particular  society?  Have  they  been 
changing  over  time?  Have  people  become  more  mobile  occupationally? 

One  approach  to  doing  this  views  a  career  as  a  sequence  of  occupations  held,  for  example, 
in  successive  years,  and  then  seeks  to  measure  the  similarity  or  dissimilarity  between  different 
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16 


Figure  3:  Observed  and  estimated  distances  for  the  Airline  distance  data  (in  units  of  100  miles). 
The  estimated  distances  from  BMDS  are  represented  by  red  dots  and  those  from  CMDS  by  green 
dots. 
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careers.  Abbott  and  Hrycak  (1990)  proposed  measuring  the  dissimilarity  between  the  careers  of 
two  individuals  by  counting  the  minimum  number  of  insertions,  deletions  and  replacements  that 
would  be  necessary  to  transform  one  career  into  another.  Costs  are  associated  with  each  kind 
of  change,  and  the  dissimilarity  between  the  two  careers  is  then  measured  as  the  total  cost  of 
transforming  one  career  into  another.  This  approach,  known  as  optimal  alignment,  is  borrowed 
from  molecular  microbiology,  where  it  is  applied  to  the  comparison  of  DNA  and  protein  sequences 
(Sankoff  and  Kruskal  1983;  Boguski  1992). 

Here  we  reanalyze  some  data  considered  by  Stovel  et  al  (1996),  consisting  of  the  careers  of 
80  randomly  selected  employees  of  Lloyds  Bank  in  England,  whose  careers  started  between  1905 
and  1909.  This  is  part  of  a  much  larger  study  aimed  at  discovering  how  career  patterns  in  large 
organizations  have  evolved  over  the  course  of  the  twentieth  century.  The  more  immediate  goal  here 
is  to  discover  what  the  typical  career  sequences  are,  for  data  reduction  and  exploratory  purposes, 
and  also  as  a  basis  for  further  analysis.  For  each  employee,  information  about  his  work  position 
is  available  for  each  year  he  was  at  Lloyds.  The  information  consists  of  the  nature  of  the  position 
(four  categories,  from  clerk  to  senior  manager),  and  the  kind  of  place  they  were  in  (six  categories, 
from  small  rural  place  to  large  city) . 

From  the  sequence  data,  an  80  x  80  matrix  of  dissimilarity  measures  was  obtained,  using  the 
method  of  Abbott  and  Hrycak  (1990);  for  more  details,  see  Stovel  et  al  (1996).  Clearly  the  dissim¬ 
ilarities  are  not  Euclidean  distances,  and  may  not  satisfy  certain  geometric  properties  that  hold  for 
Euclidean  distances,  such  as  the  triangle  inequality.  Our  approach  is  to  model  the  dissimilarities  as 
before,  with  the  idea  that  the  non-Euclidean  nature  of  the  dissimilarities  can  be  modeled  at  least 
approximately  as  part  of  the  error.  As  we  will  see,  this  supposition  turns  out  to  be  reasonable  in 
practice. 

We  applied  BMDS  to  the  dissimilarity  data.  Table  3  presents  the  results  of  the  analysis  together 
with  STRESS  from  CMDS.  Again,  BMDS  performed  much  better  than  CMDS  especially  when  the 
dimension  is  too  small  or  too  large.  The  improvement  in  performance  of  BMDS  is  more  pronounced 
in  this  example  than  in  the  two  previous  examples.  This  suggests  that  BMDS  is  more  robust  than 
CMDS  to  variations  in  the  alleged  dimension  and  to  violations  of  the  Euclidean  model  assumption. 

Dimension  8  is  chosen  as  optimal  since  MDSIC  attains  its  minimum  at  8.  Thus,  the  estimated 
configuration  X  when  p  =  8  can  be  used  as  a  final  estimate  of  X.  Figure  5  shows  the  fitted 
and  observed  dissimilarities  for  both  BMDS  and  CMDS.  The  BMDS  fitted  dissimilarities  fit  the 
observed  ones  very  well,  considerably  better  than  the  CMDS  ones;  the  sum  of  squared  residuals  for 
BMDS  is  less  than  half  that  for  CMDS. 

Figure  6  gives  pairwise  scatter  plots  of  the  first  four  dimensions  of  the  BMDS  estimates  of  X. 
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dim 

CMDS 

STRESS 

BMDS 

STRESS 

LRT 

Penalty 

MDSIC 

1 

0.5357 

0.3545 

-4228.1 

325.8 

26924 

2 

0.3390 

0.1815 

-3380.3 

315.7 

23022 

3 

0.2190 

0.1063 

-2924.9 

310.9 

19957 

4 

0.1280 

0.0669 

-1540.1 

317.5 

17343 

5 

0.0891 

0.0524 

-941.2 

330.5 

16120 

6 

0.0725 

0.0452 

-600.9 

326.7 

15510 

7 

0.0619 

0.0411 

-392.7 

330.3 

15236 

8 

0.0558 

0.0386 

-221.8 

330.5 

15173* 

9 

0.0547 

0.0372 

27.8 

367.6 

15282 

10 

0.0556 

0.0374 

7.7 

396.1 

15677 

11 

0.0600 

0.0375 

-17.3 

310.3 

16081 

12 

0.0637 

0.0374 

-21.2 

444.5 

16374 

Table  3:  Analysis  of  the  LLoyd  Bank  data.  The  minimum  MDSIC  is  marked  with  a  star. 
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Figure  5:  Fitted  and  observed  dissimilarities  for  the  LLoyd  bank  data.  The  red  dots  represent 
BMDS  and  the  green  dots  correspond  to  CMDS. 
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The  fourth  dimension  clearly  separates  two  outliers.  On  closer  inspection  of  the  data  it  turned  out 
that  these  were  both  individuals  who  had  very  short  careers  at  Lloyds.  They  spent  only  a  few  years 
there,  whereas  all  the  other  employees  were  at  Lloyds  for  at  least  ten  years. 

The  sociologists’  interest  in  these  data  is  primarily  to  characterize  the  typical  career  patterns  at 
Lloyds  in  this  period.  To  try  to  answer  this  question,  we  applied  model-based  clustering  (Banfield 
and  Raftery  1993;  Fraley  and  Raftery  1998)  to  the  BMDS  estimate  of  X,  after  removing  the 
two  clear  outliers.  This  models  the  data  as  a  mixture  of  multivariate  normals,  allowing  for  possible 
geometrically-motivated  constraints  on  the  covariance  matrices  of  the  different  groups.  The  number 
of  groups  and  the  clustering  model  are  both  chosen  using  approximate  Bayes  factors,  approximated 
via  BIC. 

Model-based  clustering  clearly  identified  three  groups.  These  are  shown  in  Figure  7,  which 
displays  the  first  two  components  of  the  BMDS  solution.  The  three  groups  selected  make  clear 
substantive  sense:  Group  1  consists  of  16  employees  who  had  shorter  careers  (22  years  or  less),  and 
spent  all  or  almost  all  of  their  career  at  the  lowest  clerk  rank.  Group  2  consists  of  30  employees 
with  long  careers  (40  years  or  more),  almost  all  of  whom  ended  their  careers  at  the  lowest  clerk 
level.  Group  3  consists  of  32  employees,  most  of  whom  were  promoted  and  ended  their  careers  as 
managers. 

6  Summary  and  Discussion 

In  this  paper,  we  have  proposed  a  Bayesian  approach  to  object  configuration  in  multidimensional 
scaling  and  a  simple  Bayesian  dimension  criterion,  MDSIC,  based  on  the  estimated  object  con¬ 
figuration.  Bayesian  MDS  provided  a  better  fit  than  classical  MDS  (Torgerson,  1952,1958)  in  all 
the  cases  we  tried.  The  improvement  in  performance  of  BMDS  is  more  pronounced  when  the  dis¬ 
similarities  are  different  from  Euclidean  distances  and  the  effective  dimension  is  ambiguous.  This 
sort  of  robustness  is  useful  in  practice,  since  in  applications  dissimilarities  are  often  not  Euclidean 
distances  and  the  concept  of  dimension  may  not  even  arise  in  their  formulation.  Another  consid¬ 
eration  is  that  one  may  often  want  to  use  two  or  three  dimensions  for  visual  display,  although  the 
true  dimension  may  be  much  higher.  The  proposed  dimension  selection  criterion,  MDSIC,  is  easy 
to  compute  and  gives  a  direct  indication  of  optimal  dimensionality.  An  advantage  of  MDSIC  is 
that  it  uses  the  BMDS  output,  which  seems  to  give  good  object  configurations  even  when  some  of 
the  model  assumptions  are  violated. 

A  key  feature  of  BMDS  is  that  when  the  dimension  increases,  the  coordinates  for  lower  dimen¬ 
sions  are  changed,  whereas  in  CMDS  the  coordinates  for  a  lower  dimension  are  always  a  subset  of 
those  for  a  higher  one.  The  coordinates  obtained  from  the  lower  dimensions  are  not  necessarily 
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Figure  7:  Lloyds  bank  Data:  First  two  BMDS  dimensions,  with  the  3-group  model-based  clustering 
classification  shown.  The  ellipses  show  the  one-standard-deviation  contours  of  the  densities  of  each 
of  the  component  multivariate  normal  distributions,  and  the  dotted  lines  show  their  principal  axes. 
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an  optimal  choice  when  the  dimension  increases,  and  retaining  them  in  higher  dimensions  may 
adversely  affect  the  performance  of  CMDS. 

One  common  reason  for  doing  MDS  is  to  cluster  the  objects.  In  our  third  example,  we  showed 
how  model-based  clustering  can  be  used  to  do  this,  providing  a  formal  basis  for  choosing  the 
number  of  groups.  The  results  were  substantively  reasonable  and  useful.  Combining  BMDS  and 
model-based  clustering  thus  provides  a  fully  model-based  approach  to  clustering  dissimilarity  data, 
including  ways  to  choose  the  dimension  of  the  data  and  the  number  of  groups. 

A  more  comprehensive  approach  to  this  problem  would  be  to  build  a  single  model  and  carry  out 
Bayesian  inference  for  it.  This  could  be  done  by  using  a  prior  distribution  of  X  that  is  based  on  a 
mixture  of  multivariate  normal  distributions,  rather  than  a  single  one  as  here.  Then  MCMC  could 
be  used  to  estimate  both  object  configuration  and  group  membership  simultaneously.  This  approach 
could  also  provide  a  way  of  choosing  the  dimension  and  the  number  of  groups  simultaneously,  rather 
than  sequentially,  as  we  did  in  our  example.  This  seems  desirable  because  there  may  be  a  tradeoff 
between  dimension  and  number  of  groups.  A  maximum  likelihood  approach  to  the  problem  of 
clustering  with  multidimensional  scaling  of  two-way  dominance  or  profile  data  was  proposed  by 
DeSarbo  et  al  (1991),  but  this  is  somewhat  different  from  the  present  context,  where  the  data  come 
in  the  form  of  dissimilarities. 

We  have  modeled  dissimilarities  as  being  equal  to  Euclidean  distances  plus  error.  This  cor¬ 
responds  to  metric  scaling,  and  so  our  approach  would  perhaps  best  be  called  Bayesian  metric 
multidimensional  scaling.  There  has  been  a  great  emphasis  in  the  MDS  literature  on  nonmetric 
scaling,  however.  In  nonmetric  scaling,  dissimilarities  are  modeled  as  equal  to  a  nonlinear  function 
of  distance.  This  could  be  incorporated  in  the  present  framework  by  replacing  (3)  by 

dij  ~  N(g(5ij),a2)  I{dij>  0),  *  /  j,z,  j  =  1,  ...,n,  (17) 

where  g(-)  is  a  nonlinear  but  monotonic  function.  One  could  postulate  a  parametric  model,  or  a 
family  of  parametric  models,  for  one  such  family  of  models  was  proposed  by  Ramsey  (1982).  Then 
standard  Bayesian  inference  via  MCMC  would  again  be  possible,  leading  to  Bayesian  nonmetric 
multidimensional  scaling. 

Apart  from  the  present  work,  we  do  not  know  of  any  other  Bayesian  analyses  of  multidimensional 
scaling  for  dissimilarity  data.  DeSarbo  et  al  (1999)  proposed  a  Bayesian  approach  to  multidimen¬ 
sional  scaling  when  the  data  are  in  the  form  of  binary  choice  data,  but  this  is  rather  different  from 
the  present  context,  where  the  data  take  the  form  of  dissimilarities. 
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APPENDIX 


Justification  of  (11) 

Integration  of  /i(cr2,X)  where 


h{a\X)  =  (a2) 


2\  —  (m/2+a+l) 


exp 


SSR/2  +  b_  ^lQg  $  /Sij 

O1  V  CT 

Oj 


is  not  straightforward.  However,  in  most  cases  m  =  n(n  —  l)/2  is  very  large  and  the  likelihood  of 
a2  dominates  the  prior  and  hence  h(a2,  X)  is  approximately  proportional  to  the  likelihood 


l (cr2,  X)  =  (cr2)  m/2  exp 

In  addition,  because  of  the  large  to,  the  likelihood  l (a2.  X)  is  well  approximated  by  a  normal 
density  function.  Thus,  applying  a  Laplace  approximation  to  the  integral  of  l(a2,X)  gives 

J  h(a2,X)da2  «  J  l(cr2,  X)da2  «  (2v r)1/2H^1/2/(X,  <j2),  (19) 

where  H  is  the  minus  Hessian  of  the  log  likelihood  and  <j2  is  the  MLE  of  cr2. 

We  now  argue  that  the  probability  4>(5jj/cr)  is  unlikely  to  have  much  effect  on  the  model 
comparison,  and  can  safely  be  ignored.  Suppose  we  are  comparing  dimension  p  with  dimension 
(p  +  1).  We  distinguish  between  two  situations.  Suppose  first  that  the  true  dimension  is  (p  +  1). 


—  -  Viog 

V  cr 

i>3 


2(7  2 


(18) 
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Then,  asymptotically,  the  term  (— SSR/2ct2)  will  dominate  the  exponent  on  the  right-hand  side  of 
(18),  dimension  (p+1)  will  be  preferred,  and  the  term  J2i>j  log  <I)(^,j/'T)  will  l,e  immaterial.  Second, 
suppose  instead  that  the  true  dimension  is  p.  Then  the  fitted  5{j  will  be  the  same,  asymptotically, 
for  dimension  p  as  for  dimension  (p  +  1),  and  so  the  term  Yli>j  log &(8ij/a)  will  be  the  same  for 
both  dimensions.  Thus  it  will  cancel  in  the  comparison,  and  can  again  be  ignored. 

Thus,  we  ignore  the  term  ■  log  <&(5ij/cr)  and  use  the  approximation 

Z(a2,X)  «;>2,X)  =  (a2)-™/2  exp[-^  . 

Replacing  /  by  1*  and  H  by  the  minus  Hessian  H*  of  1*  in  (19)  and  letting  a2  =  SSR/m  which 
maximizes  l*  gives  the  formula  (11). 
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